LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

نویسندگان

  • Wang Ding
  • Songnian Yu
  • Shanqing Yu
  • Wei Wei
  • Qianfeng Wang
چکیده

The task of Text Classification (TC) is to automatically assign natural language texts with thematic categories from a predefined category set. And Latent Semantic Indexing (LSI) is a well known technique in Information Retrieval, especially in dealing with polysemy (one word can have different meanings) and synonymy (different words are used to describe the same concept), but it is not an optimal representation for text classification. It always drops the text classification performance when being applied to the whole training set (global LSI) because this completely unsupervised method ignores class discrimination while only concentrating on representation. Some local LSI methods have been proposed to improve the classification by utilizing class discrimination information. However, their performance improvements over original term vectors are still very limited. In this paper, we propose a new local Latent Semantic Indexing method called ”Local Relevancy Ladder-Weighted LSI” to improve text classification. And separate matrix singular value decomposition (SVD) was used to reduce the dimension of the vector space on the transformed local region of each class. Experimental results show that our method is much better than global LSI and traditional local LSI methods on classification within a much smaller LSI dimension.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spam Filtering using Contextual Network Graphs

This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of...

متن کامل

A new differential LSI space-based probabilistic document classifier

We have developed a new effective probabilistic classifier for document classification by introducing the concept of differential document vectors and DLSI (differential latent semantic indexing) spaces. A combined use of the projections on and the distances to the DLSI spaces introduced from the differential document vectors improves the adaptability of the LSI (latent semantic indexing) metho...

متن کامل

A Content Vector Model for Text Classification

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the traini...

متن کامل

Latent Semantic Indexing for Patent Documents

Since the huge database of patent documents is continuously increasing, the issue of classifying, updating and retrieving patent documents turned into an acute necessity. Therefore, we investigate the efficiency of applying Latent Semantic Indexing, an automatic indexing method of information retrieval, to some classes of patent documents from the United States Patent Classification System. We ...

متن کامل

Combining the Classifiers and Lsi Method for Efficient and Accurate Text Classification

Text classification involves assignment of predetermined categories to textual resources. Applications of text classification include recommendation systems. Personalization, help desk automation, content filtering and routing, selective alerting, and training. This paper describes an experiment for improving the classification accuracy of a large text corpus by the use of dimensionality reduct...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008